Thai Word Segmentation with a Brain-Inspired Sparse Distributed Representations Learning Memory

Word segmentation is necessary for many natural language processing, especially Thai language, that is, unsegmented words. However, wrong segmentation causes terrible performance in the final result. In this study, we propose two new brain-inspired methods based on Hawkins' approach to address Thai word segmentation. Sparse Distributed Representations (SDRs) are used to model the neocortex structure of the brain to store and transfer information. The first proposed method, THDICTSDR, improves the dictionary-based approach by utilizing SDRs to learn the surrounding context and combine with n-gram to select the correct word. The second method uses SDRs instead of a dictionary and is called THSDR. The evaluation uses the BEST2010 and LST20 standard datasets for segmentation words by comparing them with the longest matching, newmm, and Deepcut, which is state-of-the-art in the deep learning approach. The result shows that the first method provides the accuracy, and performances are significantly better than other dictionary bases. The first new method can achieve F1-Score at 95.60%, comparable to the state-of-the-art and Deepcut F1-Score at 96.34%. However, it provides a better performance F1-Score at 96.78% in learning all vocabularies. In addition, it can achieve 99.48% F1-Score beyond Deepcut 97.65% in case of all sentences being learnt. The second method has fault tolerance to noise and provides overall result over deep learning in all cases.


Introduction
Natural language processing (NLP) applications have grown exponentially, for example, sentiment analysis, information retrieval, text classifcation, machine translation, speech recognition, and question and answer. Some approaches request word level separately before processing in downstream tasks, for instance, using word embeddings for classifcation. Latin-based English language is easily tokenized into words by observing delimiter characters such as spaces, semicolons, commas, quotes, and periods. Unsegmented languages such as Tai, Chinese, Japanese, and Korean do not have explicit word boundaries to use delimiters to separate words. Tey require a specialized algorithm to fnd word boundaries before proceeding.
Tai word segmentation was developed frstly since 1981 and was divided into three types [1], namely, rule-based, dictionary-based, and learning-based techniques. Rulebased is created by hard coding. However, the language is complex and can only cover some existing rules or any unknown words that can introduce new rules. Dictionarybased uses a set of words from dictionaries by looking series of characters in the dictionary to fnd matches. Te dictionary-based performance depends on the dictionary's size, the approach to handling unknown words, and the ambiguity that it founds multiple ways to segment a text. Te easiest way to fx the ambiguity problem is by selecting the longest word. However, the performance is still low as the shorter word might be correct. Tus, another approach understands that the context of a word is mainly found in the learning-based technique.
Te learning-based technique is learnt by marking word boundaries explicitly and using machine learning algorithms to build a model. Te approaches include using the Hidden Markov Model (HMM) [2], Conditional Random Fields (CRF) [3], as well as Deep learning that is currently state of the art, such as Deepcut [4] and Attacut [5]. Te advantage of the approach is that it has no requirement for dictionaries. Te unknown word and ambiguity problem can be handled by using statistical characteristics. However, the drawback is that it requires a training data set and depends on the domain that is used to train, the size of training data, and labelling boundaries, which is a laborious task that takes time and efort.
Tis research proposes two new methods. Te frst method is based on the dictionary, and it can learn by using SDRs combination with n-gram, which is called THDICTSDR. Te method adapts Hawkins's approach to Tai Word Segmentation problem by using Sparse Distributed Representations (SDRs) including the fact that it proposes a new encoder for Natural Language Processing (NLP) and combination with n-gram. Te results show that neuroscience approach can also produce the accuracy performance comparable to the state of the art or deep learning approach. Te second method, THSDR, uses SDRs instead of a dictionary to fnd words. Tis approach can improve fault tolerance to noise, which is particularly useful in applications where the data are not precleaned, such as Tai OCR.
It is important to note that this research is built upon a previous study [6] that introduced a new brain-inspired approach to address spelling check problems. Tis approach has been demonstrated to yield better results than deep learning methods, and it is also fault-tolerant to noise. Additionally, this approach is not limited to spelling check problems but can also be applied to Tai word segmentation in this research. Consequently, the research ofers the following benefts: (1) Tere is no requirement to learn more from training data. Te paper demonstrates that when the dataset includes more noise, the performance of deep learning models sufers. With deep learning models, the need to learn from a new error model requires retraining, which can be difcult to accomplish during working in an application. However, learning only from a dictionary or correct words is a simpler and more feasible alternative. (2) SDRs operate using bits, which makes it faster than numerical operations. For example, when fnding similarities in a dictionary with 700,000 vocabularies, the processing time using numerical operations takes around 13 seconds. However, with SDR, the processing time is only 0.05 seconds.
In summary, this research has the following contributions: (1) Introducing a new combination method between dictionary-based and learning-based approaches that achieves accuracy performance comparable to state-of-the-art (Deepcut) and even outperforms it in some cases. (2) Proposing a new all-SDR approach that is capable of handling noise in situations where no training data exist for word boundaries. Tis approach is especially relevant for Tai OCR, which faces challenges associated with both spelling checks and word boundary identifcation simultaneously. (3) Demonstrating the advantages of the proposed method, including the ability to learn quickly by providing new vocabulary, which is not the case for the state-of-the-art that requires training data with explicit word boundaries for learning a new error model.
Tis paper is organized as follows: Section 2 presents related works, some of which are used to evaluate the new method. Section 3 provides inspiration and background for this research. Section 4 explains the concept of the new proposed approaches. Next, Section 5 demonstrates the evaluation method and results. Finally, in Section 6, a summary of this research is presented.

Related Works
Te frst rule-based [7] is created based on Tai grammar, and [8] improves the rules by using Tai spelling principles, and many rules are used in combination manners. Te frst dictionary-based [9] was proposed in 1986 using the dictionary combined with the longest matching technique by selecting the word with the longest length. If selecting cannot fnd the rest of the sentence, it will backtrack and fnd the next longest word. However, it will fail if the correct word is not the longest one or multiple unknown words are found. Te maximal matching algorithm [10] was proposed to cope with the longest matching problem by fnding all possible segmentation for a sentence and choosing one that contains the fewest words. Nevertheless, fnding all possible words is a brute force method in a long sentence, and many candidates are generated. Terefore, the method does not guarantee the selection of the correct one and cannot determine the best one if candidates have the same number.
Another improvement of the dictionary-based is a trie structure that was proposed in 1991 [11]. Instead of keeping all words in a list, the method creates a structure like a tree to reduce the storage size and fnd words by moving in a tree node for each character. Tus, it can fnd the word via trie, which is faster than fnding a word from a list. TLS-ART-MC [1] proposes a combination of Ranking Trie, Soundex and two-pass segmentation. Ranking trie is sorting the character node that depends on their frequency. A higher frequency node will be closer to the root node, and the node will be found frst. Tus, it makes the trie structure smaller and performs faster. Soundex is used to cope with misspelling problems. Te Soundex is designed for searching for an expected name with a diferent spelling. A word is converted to a code using rules; thus, the same code means the same word. Once the text passes segmentation from trie and 2 Computational Intelligence and Neuroscience Soundex, the segment words will be combined using rules based on Tai grammar. Another approach is learning-based by using the statistical technique, a Viterbi-based, [2] to employ statistical information derived from grammatical tags. Te idea is to fnd the path that gives the maximum probability. Moreover, [12,13] propose using the word trigram with the part of speech. However, the method only captures its state and corresponding words, but the state might depend on something other than its following words or other adjacent words. In machine learning research, word segmentation is a classifcation problem that defnes each character in the string as one of the binary classes, with the beginning of a word labelled "B" and an intraword character labelled "I." Tese labelled characters and segmented words are trained with machine learning algorithms. Comparative research [14] is proposed to compare four learning-based algorithms that include Naïve Bayes (NB), Decision Tree, Support Vector Machine (SVM), and Conditional Random Field (CRF), the longest method and maximal matching. Te result shows that the dictionary-based algorithms perform better than NB, Decision Tree, and SVM. However, the best result is shown by the CRF algorithm.
Instead of using segmentation words as a base, some research uses syllable or smaller group of characters that presents the performance better. In [15], the author uses two processes, namely, syllable segmentation and syllable merging. Firstly, the research defnes about 200 syllable patterns and trigram statistics for syllable segmentation. Ten, merging by fnding possible sequence word segmentation from the dictionary and select the best word from the maximum collocation strength. In [3], the authors propose using minimum text units to extract the smallest units that constitute words and then using CRF to identify syllables and merge syllables by a set of rules. Another research proposes grouping Tai contiguous characters into inseparable units called Tai Character Clusters (TCCs) [16] that are defned by a set of rules. However, this research is provided for information retrieval to improve search accuracy.
Te word segmentation state-of-the-art uses deep learning that can learn from data and error. A popular Tai word segmentation is Deepcut [4], which uses convolutional neural networks (CNNs) and shows that the accuracy from the experiment can achieve 96.57% and F1 at 96.34%. An improvement of Deepcut is Attacut [5] which uses syllable embedding as features together with character embedding. Even the accuracy and F1 of the research are not over Deepcut, but the processing time is at least 5.6x times faster than Deepcut. In [17], the authors provide word segmentation and POS tagging by jointing models of both tasks to improve overall accuracy. Te word boundaries are produced frst and then become an input for tagging. Te input is character-level n-gram, and the next layer incorporates the n-gram features with their surrounding contexts by using bidirectional recurrent neural networks (RNNs). Te accuracy results are above 90% on F1-Score.

Inspiration and Background
AI has been dramatically improved and developed today because of a machine learning approach called Deep Learning (DL), which produces impressive results by relying on backpropagation and mathematical optimization models. However, it is still far from the goal of creating AI at the level of human intelligence and unclear whether the current AI approaches can lead to this goal. Tus, instead of focusing on the current approaches, this paper aims to study the neuroscience approach and uses a solution called a braininspired method.
Although deep learning also replicates a function of neurons in the human brain, it has been steadily evolving since 1957. Its current behaviours are still not much diferent from how it started, which uses adjusting the weights between neurons as they learn. It is in contrast to the knowledge of neuroscience that has undergone further research and a vastly increased understanding of the neural system. Tis paper is not created from scratch but is based on Hawkins's approach or Hierarchical Temporal Memory (HTM) and adapting it to Tai word segmentation problem.

Hierarchical Temporal Memory (HTM). HTM [18] is a theory that was initiated by Jef Hawkins in the book "On
Intelligence" [19] in 2004. It was built by refecting the functioning of the neocortex from a neuroscience perspective. Te HTM structure is similar to the neocortex as it is a uniform hierarchy and works in invariant representation characteristics. It can be separated into multiple layers, and each layer can break into a cortical column. A cortical column consists of multiple neural cells inside. Each sensory and neural cell connects by using synapses and dendrites. HTM can predict automatically by using Distal. Proximal is used to receive input signals and feedforward. It can learn by creating and strengthening connections with others if it is active together; that is called Hebbian Learning.
HTM provides a theoretical framework and basic mechanisms of how the neocortex works by inspiring and simplifying this research using SDRs, a basic form of each layer in the brain.

Sparse Distributed Representations (SDRs).
SDRs [20] are information storage and transfer information to feedforward and feedback in HTM. Information in an SDR contains "0" (Inactive) or "1" (Active) only. It is a large vector of bits with a small percentage active; this is how the brain works to reduce energy and inference with a small amount of activity. HTM uses Spatial Pooler for pattern recognition and Temporal Pooling to understand sequential learning. SDR is used as the structure memory of this research. Defnition 1. SDR is an n-dimensional vector of binary elements. SDR vector is as follows: Computational Intelligence and Neuroscience w x is the number of elements in x that are active bits, "1." Overlapping is the number of bits that are 1 in the exact location, which is the determination of the similarity between two vectors. For example, (2) X and Y vectors have n � 40 and w � 5, overlap � 2 and sparsity is 12.5%, s � w/n (5/40).
Matching is the possibility of the number of unique SDRs as follows: If n � 2048 and sparsity � 10% or w � 200, then the SDR space is 1.01 × 10 283 . Tis means the probability of two random vectors being identical is as follows:.
Tus, with n � 2048 and w � 200, the probability of two identical random vectors is very close to zero.

Union.
A good characteristic of SDR is the union that can store multiple patterns in a single SDR using OR operation with vectors. Tus, it can reduce the size of storage kept in the brain. However, this approach can increase the probability of false positives. Te probability of a false positive can be written as For example, if n � 2048 and w � 200, storing M � 20 vectors, the chance of a false positive is 1 in 8.0 × 10 11 . However, increasing the number of union vector sets, M, the false positive can become saturated with "1" bits, and random vectors will mostly return a false positive match.

Encoder.
HTM can handle any input by using the same algorithm because it uses an encoder to convert any signals into SDRs before sending them to HTM. Creating an encoder is no easy task that keeps important features passing to HTM. Te encoder selection is important because it impacts the model's performance. Tis encoder process is the same as passing visual information from the retina to the neocortex.
Examples of encoding can be found in [21]. However, encoding in NLP is not mentioned but refers to cortical.io, which is a commercial api encoding to SDRs and only provides a concept in [22]. In [23], the authors use HTM for document categorization. It uses TF-IDF, fnds Latent Semantic Indexing (LSI), and encodes numerical features into SDRs. In [24], the anomaly detection in system logs that uses GloVe word embeddings is described [25] and the numbers are encoded into SDRs.
As aforementioned, NLP in HTM commonly uses encoding numbers into SDRs. Tis paper proposes a new encoder for NLP that not only creates each character into a representation but also their connections are formed into representations.

Structure Memory.
SDR is used to be the structure memory of this research. It can work in a hierarchy structure that one representation in a layer can connect to its layer and above or lower layer. For example, encoding a text is summarized in Figure 1; the frst layer contains multiple cortical columns, and each column or representation is represented by an active "1" bit in SDR. A text can be encoded to any column depending on its encoding. Tis research proposes encoding by using a hash function that encodes not only each character to a representation but also a connection between representations is also a representation. Te second layer is a representation of words and a sentence level where representations and connections work similarly to the frst layer. Tis word and sentence information are kept in SDR, a large vector of bits with only a small number of actives based on the brain and inspired by Hawkins's approach.
One problem with word segmentation is that if multiple possible words are found, a basic approach is choosing the longest one. However, the longest word might not be correct; thus, one solution is understanding its surrounding context.
Training data, a word, and its surrounding context are encoded into an SDR using the concept in Figure 1. Tis means the algorithm learns its surrounding context and checks similarity values to determine which word can be segmented by selecting the highest similarity value. Estimating a similarity value can be found in the next section.
In this research, the number of active elements (w) and the number of bits or representations (n) are not specifed. However, if its ratio is high, each SDR might not be separated from the others. Otherwise, the memory could have been used more efciently. Tis evaluation is set n to 2048.

Similarity.
Many machine-learning approaches use weights or foating numbers to predict their output. Instead, SDRs use bits or logical operations to process each representation. It can reduce complexity, including decreased processing time, as it is easy to manipulate because the CPU supports bit operations. Besides, the modern memory structure also supports keeping the information in bits; thus, it is easily adapted. Te similarity estimation is calculated easily by fnding overlapping bits, as shown in Section 3, using AND or XOR operations among SDRs. For example, "ILIKECATS" is similar to "ILIKEDOGS" as the number of overlapping is over "ILOVESONG." Te operation can perform very fast as it operates at a bit level.

Training.
Each word in a dictionary is kept into a HashMap structure to link between a word and its SDR list. Te word is encoded to be the frst SDR for its list. Next, all words and its surrounding context are encoded into its SDR list. As Figure 2, the words "perform" and "performance" are kept into a map structure including SDRs of its word and surrounding contexts. Te length of its surrounding context is set to a threshold (default � 16). Te surrounding context can be sufx words or prefx words or both.

Matching.
Te THDICTSDR algorithm is based on a dictionary that searches for matches in a text. Firstly, it identifes a possible word list. If only one word is found, it selects that word. However, if multiple words are found, the algorithm chooses from their surrounding contexts, which are trained and stored in SDRs. For instance, as shown in Figure 3, we consider the text "performanceatthemusic." Te possible words could be "perform" and "performance." Sample SDRs for the word "perform" in sentence forms include "perform well in" and "perform the delic," while sample SDRs for the word include "performance" are "performance at t" and "performance has." Te similarity SDR value between "performanceatthemusic" and "performance at t" is the highest value, indicating that the word "performance" is the correct choice.
Similarly, in THSDR, the algorithm works like THDICTSDR, except instead of fnding a possible word list; it looks for a match by comparing SDRs with words in the dictionary. Te advantage of using SDRs is that they are fault-tolerant, meaning that even if some characters are missing or changed, the algorithm can still recognize the word. After identifying candidate words, the algorithm chooses from their surrounding contexts, which are trained and stored in SDRs, as in the previous example.   Computational Intelligence and Neuroscience

Handling Unknown Words.
Handling unknown words, the author found that unknown words have a short length and low frequency. Another observation is that if an unknown word is found, there is a high possibility of segmenting words wrong previously; thus, it needs to backtrack. Hence, once it fnds unknown words, it will set two anchor words between the unknown word by considering their length and frequency. For example, in Figure 4, known words and their frequency are ["I," 10], ["LO," 3], and ["BATS," 9]. If a text is "ILOVEBATS," the segmentation words are "I," "LO," "VE," "BATS," and "VE" is the unknown word. Nevertheless, we are sure that the words "I" and "CATS" are known words as they have a long length or high frequency. Tus, the algorithm set them as anchors. Te unknown word "VE" searches neighbouring area and fnds that "LO" has a low frequency; thus, "LO" and "VE" are merged into one word, "LOVE."

n-gram.
We use n-gram to improve the accuracy performance; the author also found inconsistent segmentation words in training data. Tis problem is the same problem mentioned in [15] that sufers from a lack of clear defnition, or even segmentation of the same person can be inconsistent. Tus, checking the co-occurrence value of the n-gram is performed. For example, for the two words "ice" and "cream," if the frequency of "icecream" is more than the co-occurrence value of "ice cream," then the two words "ice cream" will be merged into "icecream." Tis research uses 2gram to select the word.

Evaluation
Tis study evaluates the newly proposed method by comparing it with dictionary-based, longest matching, and newmm methods, which combine dictionary-based, maximum matching, and Tai character cluster [26]. Te evaluation also compares the proposed method with Deepcut, which is the state-of-the-art approach. Te evaluation is conducted using the Best2010 or Best and LST20 or LST Corpus datasets on an ASUS TUF A15 laptop with an AMD Ryzen 7 5800H, 8 CPU cores, 16 threads, 32 GB of RAM, and GPU RTX3060 6 GB. Best2010 [27] comprises 415 Tai documents, about 5.1 million words, and 104 k vocabulary, covering four domains, namely, articles, news, encyclopedias, and novels. Te LST20 [28] Corpus, on the other hand, provides fve layers of linguistic annotation, including word boundaries, POS tagging, named entities, clause boundaries, and sentence boundaries. It includes 3,164,002 words, 288,020 named entities, 248,181 clauses, and 74,180 sentences. Each dataset is used separately for training at 90% and testing at 10%. Deepcut is trained from scratch by using this 90% training data set because the pretrain version of deep cut possibly included testing data. Likewise, THSDR bases on Lexitron Tai-Eng Dictionary and then learn from training data to create SDRs.
Five metrics are used to evaluate the new model in word level, Precision, Recall, F1-Score, Intersection over Union (IoU), and processing time. Precision, Recall, and F1-Score can be calculated as an example in Figure 5 and equations 6-8.
Two parameters are evaluated; the frst parameter was the length of the surrounding context, which was set to 8, 16, and 32. Te second parameter was the size of the SDR, which was set to 1024, 2048, and 4096. During the experiment, a surrounding context length of 16 and an SDR size of 2048 were selected, as they resulted in the best performance.
Precision � the number of correct words the number of word predictions Recall � the number of correct words the number of words in the ground truth Te equation above shows that TP (True Positive) represents the number of correctly identifed word segments, while FP (False Positive) represents the number of misrecognized word segments. FN (False Negative) represents the number of unrecognized word segments.
In this evaluation, we tested the performance of the frst method using a dictionary-based approach under three diferent scenarios. Te frst scenario involved the method only learning from the training data and a common dictionary. Te second scenario assumed that the method could learn all the words in the sentences used for applications that provide interaction to users or that it had learned enough vocabulary to cover them. In the third scenario, the method  had knowledge of all words and sentence connections. Te second method was evaluated to demonstrate its fault tolerance to noise. We generated noise from the Best data set at four diferent levels: 1%, 3%, 5%, and 10% an measure the performances.
6.1. THDICTSDR Evaluation. Te frst evaluation, as shown in Table 1, indicates that the Precision, Recall, IoU, and F1-score on the best dataset for the longest matching and newmm methods are signifcantly lower than for Deepcut and THSDICTSDR. Although THSDICTSDR has higher recall performance than Deepcut at 96.89%, its precision, IoU, and F1-score have slightly lower performance. Additionally, the processing time of THSDICTSDR is higher than that of Deepcut due to its multiple rules to check. However, these results show that THSDICTSDR is comparable to state-of-the-art methods and ofers a diferent approach. It should be noted that THSDICTSDR may have lower precision due to the unknown words problem, which even the proposed method cannot handle perfectly, resulting in some incorrect segmentation.
Te second evaluation was conducted on another dataset, LST20, in Table 2. Te results for THDICTSDR were similar to those in the frst evaluation, with high recall but lower precision. While the performance of THDICTSDR was slightly lower than that of Deepcut, it still performed better than the longest and newmm methods.
Before correctly segmenting words, humans need to understand their vocabulary, and similarly, algorithms need to have a good grasp of the vocabulary for accurate segmentation. If any unknown vocabulary is encountered during segmentation, the software can notify the user and prompt for approval of a new word. Tis approach is diferent from the labelling of segmentation words used in deep learning for learning the error model. To validate if all the vocabulary was learned, the third evaluation in Table 3 was conducted. THDICTSDR outperformed Deepcut on recall, IoU, and F1-score, achieving 97.50%, 0.94, and 96.78%, respectively, on the best dataset, and 97.50% on Recall for the LST dataset. However, THDICTSDR still exhibits lower precision, IoU, and recall on the LST dataset due to its handling of unknown words.
In another case, the brain learnt correct words and word connections correctly and how the algorithm provides the performance results. In this case, the algorithm learns from training data and test on training data as assumption a human learnt all words and connections. As a result, in Table 4, THDICTSDR gives considerably F1-score at 99.48% on best and 99.37% on LST over Deepcut at 97.65%. Table 5, THSDR provides high fault tolerance over Deepcut and deep learning approach in all cases. Table 6 evaluates SDR sizes of 1024, 2048, and 4096. Te results indicate that there is not a signifcant diference in performance between the SDR sizes, including their processing time. Terefore, altering SDR sizes has minimal impact on overall performance. Tis suggests that increasing the SDR size to enhance the capacity for union operation in merging may be a viable approach to improve performance without sacrifcing processing time. Table 7 displays the results obtained by setting the word length threshold to 64, 32, 16, and 8. Te fndings suggest that increasing the word length can adversely afect performance as too many words are encoded into the SDR, leading to multiple matches. Conversely, reducing the word length can also have

Conclusion
Tis research presents two new methods that use SDRs to replicate learning from the brain. Both methods exhibit higher accuracy than other dictionary-based methods. Te frst method also yields comparable results to the state-ofthe-art Deepcut, and in some cases, even better. However, its

Future Work
Tere are many challenges and a lot of future work that can be done as follows.
(1) Encoding SDRs for NLP currently relies on hashing between characters and their connections to reduce size and processing time. To improve performance, it may be possible to use syllable patterns instead of characters. (2) Te processing time of the proposed methods is still slower than the state-of-the-art. One solution could be to use parallel processing and union vectors to speed up the computations [6]. (3) Te second method, THSDR, not only performs word segmentation but is also capable of correcting words in a hybrid manner for both word segmentation and spelling check problems. Tis hybrid approach has not been found in previous Tai language research. However, it falls outside the scope of this paper. (4) While the paper only employs SDRs from HTM, it would be valuable to explore the potential of other HTM techniques, such as the spatial pooler and temporal memory, for learning and predicting segmentation.

Data Availability
Te data used to support the fndings of this study are available at https://aiforthai.in.th/corpus.php.

Conflicts of Interest
Te authors declare that they have no conficts of interest.